View AN703_6902323.PDF datasheet online --- IC-ON-LINE

Datasheet File OCR Text:

philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 author: santanu roy 1 1996 mar 01 background a computer benchmark is a aprogramo that is used to determine relative computer core performance by evaluating benchmark execution time by that core. in the brainstorm on microcontrollers for automotive applications, an assembler functional benchmark for engine management , which is a typical example of embedded high-end microcontrol, was created. this report gives worked out routines of the functions if they were implemented in assembler language of the compared controllers: motorola 68000, intel 80c196, philips 80c552 and philips xa. the total execution times of a program aengine cycleo (engine stroke) are calculated and the required program code is estimated for each controller . evaluation of performance in a high level language (hll) like c would be preferable, but it is dif ficult to realize as athe besto compilers for all cores involved then should be used. this document is generated based on the report number dpe88187. it outlines code density and execution times of the xa, based on most recent information. the execution times are given in terms of both clock cycles and time units. although xa can run at speed of 30 mhz @ 5.0 v olts, for sake of fairness, all cores are evaluated for running at 16.00 mhz. this is reasonable for comparing the cores at the same level of technology. a separate section is included in this benchmark for abit manipulationo function benchmark results only . this (bit-test) routine is a stand alone one and should not be considered as a part of engine management routine. benchmark results and conclusions relative performance on a line the table below presents the most important result of the assembler benchmark evaluation. it pictures the relative performance of the compared core instruction set on a scale where xa=1.0. also appended is the performance charts-execution and code density of all the processors. total exec.times/core( m s) for all routines (with *occurrences) 5,942 1,560 1089.24 402.6 performance ratio 8051 68000 80c196 xa 8051 1.0 3.81 5.45 14.7 68000 0.34 1.0 1.43 3.85 80c196 0.18 0.7 1.0 2.7 xa 0.068 0.26 0.37 1.0 table 1. xa instruction set execution times and bytes/function function oc* xa bytes/function function oc* exec. time /funct.( m s) occurrence *time/funct. bytes/function mpy 12 0.75 9 2 fdiv 4 3.94 15.8 18 add/sub 50 0.38 19 4 cmp 24b 13 1.06 13.78 9 can 16b 40 0.563 22.52 5 intplin 20 1.98 41.3 14 interr 10 6.1 61 41 branch 10 153.1 xa totals : 335.5 m s including 20% statistics : 402.6 m s
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 2 table 2. 68000 instruction set execution times and bytes/function function oc* 68000 bytes/function function oc* exec. time /funct.( m s) occurrence *time/funct. bytes/function mpy 12 4.4 52.8 2 fdiv 4 13.4 53.6 16 add/sub 50 2.75 137.5 12 cmp 24b 13 3.2 41.6 14 can 16b 40 2.7 108 14 intplin 20 7.5 150 14 interr 10 21.9 219 92 branch 10 537.5 68000 totals : 1,300 m s including 20% statistics : 1,560 m s table 3. 80c196 instruction set execution times and bytes/function function oc* 80c196 bytes/function function oc* exec. time /funct.( m s) occurrence *time/funct. bytes/function mpy 12 1.75 21 3 fdiv 4 9.5 38 19 add/sub 50 1.25 62.5 7 cmp 24b 13 4.25 55.2 14 can 16b 40 2.5 100 6 intplin 20 6.4 128 18 interr 10 12.8 128 58 branch 10 375 80c196 totals : 907.7 m s including 20% statistics : 1,089.24 m s table 4. 8051 instruction set execution times and bytes/function function oc* 8051 bytes/function function oc* exec. time /funct.( m s) occurrence *time/funct. bytes/function mpy 12 37.5 450 58 fdiv 4 451.5 1806 96 add/sub 50 7.5 375 19 cmp 24b 13 9.98 129.74 22 can 16b 40 9 360 14 intplin 20 25.8 516 20 interr 10 31.5 315 70 branch 10 1000 8051 totals : 4,951.74 m s including 20% statistics : 5,942 m s
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 3 table 5. total benchmark execution time results microcontroller core execution time ( m s) xa 402.6 68000 1560 80c196 1089.24 8051 5942 as the total activity has to be completed in one machine stroke of 2 ms, the xa, and the 80c196 will be able to meet the application requirements. the 80c552 originally was assumed to complete the functions over more than one stroke. best ef ficiency is of the xa and the 80c196. the 80c196 includes 3-parameter instructions that reduce the instruction count per function and it has jb/jbn instructions. it also uses half-word (1-byte) codes for frequently used instructions. the lower code ef ficiency of the 8051 instruction set can mainly be explained by the aaccumulator bottlenecko which is not present in xa: most data has to be transported to and from the accumulator be fore add/sub/cmp can be done, operations on words require 4 amovo instructions and 2 data execution instructions. the ef ficient jb and jbn instructions compensate this for a great part. benchmark limitations like all benchmarks, the automotive engine management assembler functional benchmark has some weakness that limit validity of its results. 1. control in a special (automotive, engine) environment is evaluated. 2. occurrences of operation overheads are based on estimations. 3. occurrences of functions are based on estimations. 4. functions are implemented in assembler , not in a hll like c. 5. routines may contain assembler implementation errors. 6. all cores are evaluated at 16.0 mhz control in a special environment is evaluated (automotive, engine) the core performance evaluation is based on a single specialized case. all benchmark implementations are fractions of the automotive engine management pcb83c552 demonstration program. it can be advocated that the automotive engine control task gives a good example of a typical high demanding control environment, where many >= 16 bit calculations have to be done. occurrences of overheads are based on estimations the assembler functional benchmark is not a full implementation of a program. arbitrary choosing location for storage of parameters in register file or (external) memory, for instance, has for some instruction set a considerable ef fect on the total execution time. for the dif ferent core parameter storage is chosen where possible using the core facilities to have minimum access overhead. occurrences of functions based on estimations occurrences is estimated on basis of experience of the automotive group. in a real implementation of an engine controller accents may shift. as most functions already include some ainstruction mixo, the effect of changes in occurrences is limited. functions are implemented in assembler, not in a hll like c control programs for embedded systems get larger , have to provide more facilities and have to be realized in shorter development times. the only way to do this is to program in a hll like c. ef ficient c-language program implementation requires dif ferent features from microcontrollers than assembly programs. results of this assembler benchmark evaluation therefore have a restricted value for ranking microcontroller performances for future hll applications. benchmark ranking on basis of hll like c requires good c-compilers of all the devices involved are needed. the quality of the c-compilers really has to be the best there is: hll benchmarking measures not only the micro characteristics, but even more the compiler ability to use these qualities. as these are not available for all the micros evaluated, all routines are worked out only in assembly. routines may contain assembler implementation errors assembler routine implementations are made after a short study of the micro specifications and are not checked by assembling or debugging in real hardware environment. it can be rather safely said that a complete system setup and program debug to correct errors would not lead to considerable dif ferences in performance results. deviations in function occurrences and overheads may have a more significant ef fect on performance ratios. all cores are evaluated at 16.0 mhz a 16.0 mhz internal clock frequency seems a reasonable choice for comparing the cores at the same level of technology .
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 4 assembler functional benchmark for automotive engine management this benchmark is a functional benchmark: it is a collection of functions to be executed in an automotive engine management program. it would be preferable to implement the complete control program in assembler and evaluate it in a real hardware environment, but this is not practical as every implementation requires many man-months to realize. t o implement the assembly functional benchmark for automotive engine management correctly the arules and detailso described in this section have to be followed carefully . the assembler functional benchmark embraces all activity to be completed in 1 program cycle that corresponds with 1 engine stroke of 2 ms. the benchmark execution time will be calculated as the sum of the products of functions and their occurrence rates in 1 calculation cycle. branches are evaluated separately as abranch penaltieso have considerable effect of program execution efficiency. estimated (branch count)*(average branch time) is added to the function execution times. the relative estimated overhead for statistics does not contribute to the evaluation of speed performance ratios, but they have to be considered when looking at the total execution time required / engine stroke cycle. therefore the real total execution time is multiplied with the statistics overhead factor (1.2*). no. function description occurrences 1 16 16 multiply 12 2 floating point divide (16:16) 4 3 add/subtract (24) 50 4 compare (24) 13 5 can cmp/mov 10*8 80 6 linear interpolation (8*8) 20 7 interrupts 10 8 program control branches 500 9 statistics (20%) 1.2 * function parameter allocation most functions are very short in exec. time, so that the function parameter data access method has great ef fect on the total time. thus it is to be considered carefully. some core features a large register files (xa, 80c196) in which variables can be stored, others with few registers (68000) have to store all data in memory. for the xa/80c196 processor , data stored in the lower part of register file, or in sfrs for i/o, can be accessed using adirecto addressing, but table data, used, e.g., for 3 bye compare, is stored in aexternal memoryo. the 68000 assume data in memory (or memory mapped i/o) as not enough data registers are available. all 68000 memory data has to be accessed using long-absolute addressing: 68000 short addresses are relative to memory address 0000 and are therefore not useful. for more complex functions 16*16 multiply , floating point division and interpolation, data is assumed to be already in registers. 16 16 signed multiply parameters are assumed to be in registers, and the 32-bit result written into a register pair. divide (16:16) afloating pointo the floating point division is entered with parameters in registers: a divisor , a dividend and an aexponento that determines the position of the fraction point in the result. floating point binary 16/16 division is a function that is normally not included in hll compilers as it requires separate algorithms for exponent control and accuracy is limited. for assembler control algorithms, floating point division can be quite ef ficient as it is much faster than normal arealo number calculations (where no afloating point acceleratoro hardware is available). compare 24-bit variables note that 24-bit compare is very ef ficient for arealo 16-bit and 8-bit) controllers, but for automotive engine timers, 24-bit seems a good solution. compare must give possibility to decide >, < or =. for 68000, and 80c196 instruction set lt , eq and gt are included in the cc after cmp. can move and compares for service of the can serial interface, it is estimated that 40* (2 byte compares + branch) have to be done. devices with 16-bit bus assumes word access. an average branch is included in the can compare function.
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 5 linear interpolation (8*8) the interpolation routine is entered with 3 register parameters: 1. table position address 2. x fraction 3. y fraction the routine first interpolates using the x fraction the values of f(x.x, y) between f(x,y) ....v(x+1, y) and of f(x.x, y+1) between f(x, y+1) .... f(x+1, y+1). from f(x.x, y) and f(x.x, y+1) the value of f(x.x, y .y) is interpolated using the fraction of y . the table is organized as 16 linear arrays of 16 x-values, so that an v(x,y) can be accessed with table origin address +x+16*y = at able position addresso. in x-direction the interpolation can be done between the at able positiono value and next position (+1). interpolation in y-direction is done by looking at atable positiono + 16. for linear interpolation time the 2-dimensional interpolation time and byte count are divided by 3 to include some aoverheado into linear interpolation. interrupts the average interrupt routine overhead includes the following stages: a. interrupt recognition and return b. 1 * (long) branch c. 2 * jump (short) on bit d. 1 * call (long) and subroutine return e. 2* set bit and 2 * clear bit f. 5 * pop and 5 * push (or move multiple) [free 5 registers for local use] g. 1 * mov #xxx, pst program control overheads for a given algorithm, the program control overheads consisting of a number of decisions (branches) and subroutine calls is independent of the instruction set used, except for cases where functions can be replaced by complex instructions. the most important exception cases, mpy words and floating point division are handled in this benchmark separately . most 16-bit cores use more pipeline stages so that taken branches add branch time penalty for these cpu' s due to pipeline flush. this ef fect can be found in the branch execution time tables. more ef ficient data operations and pipeline penalty of the more complex instruction set of 16-bit cores lead to considerable higher relative time used for branch instructions. t o incorporate the influence of branches in the benchmark the number of branches to be included must be estimated. for byte and bit routines, branches occur more frequent. a verage branch time of 25% may be a good guess. for the automotive engine management benchmark that executes in approx. 5000/ m s (on 8051) results in +/- 1250 / m s or 625 branches. as a part of the branches already taken account for in the compare functions the number of additional program control branches is estimated 500 branches. t o estimate the average branch execution time, an estimated relative occurrence of the branch types has to be made. table 6. estimated relative occurrence of the branch types type relative absolute occurrence absolute jumps ajmp/jmp 20% 100 subroutine calls acall/jsr 20% 100 jump on condition (rel) bcc/jcc 40% 200 jump on bit (rel) jb/jbn 20% 100 statistic routine overheads statistic routines are estimated as relative program overheads, only to get an indication of the required total processing time in a real engine management application. astatisticso are mainly arithmetic routines to determine table corrections. they use about 20% of the total time.
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 6 xa benchmark results the following analysis assumes worst case operation. at any point in time, only 2 bytes are available in the instruction queue. an instruction longer than 2 bytes requires additional code read cycle. appendix 1 xa function implementations xa reference: xa user's manual 1994 16 16 signed multiply parameters are assumed to be in registers, and the 32-bit result written into a register pair . the mul.w r,r is encoded in the xa instruction set as a 2 byte instruction. the exact optimization for this instruction (such as skip over 1' s and 0' s) has not been concluded at this point, and the execution time may be data dependent and shorter than one outlined here. the basic algorithm utilizes 2-bit booth recoding. instruction fetch and decode time overlaps the execution of the preceding instruction (except when following a taken branch), so it is ignored. the total execution time is either 1 1 or 12 clocks, including operand fetch and write back (1 clock is dependent on critical path analysis). a1.1: 16 16 multiply bytes clocks mul.w r0, r1 2 12 (0.75 m s) a1.2: floating point 16x16 divide: the algorithm here follows the one outlined for the 80c196. arguments: r4 = dividend (extend into r5 for 32 bits) r6 = divisor mantissa r0 = divisor exponent bytes clocks fpdiv: adds r6, # 0 ; add short format 2 3 beq l1 ; check for divby0 2 3 (not taken) ; sgnxtd_and_shft: ; sext r5 ; sign extend into r5 2 3 asl r4, r0 ; 13 position shifts 2 11 ; div: ; div.d r4, r6 ; divide 32x16 signed 2 21 bov l1 ; branch on overflow 2 6 (taken) ret ; normal termination 2 8 ; l1: ; movs r4, # 1 ; overflow max result 2 3 (not executed) ret ; 2 8 18 63 (3.94 m s) a1.3: extended 32-bit subtract ; r5:r4 = minuend ; r3:r2 = subtrahend sub r4, r2 2 3 subb r5, r3 2 3 4 6 (0.38 m s )
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 7 a1.4: compare 24-bit variables only minimum execution time is considered here. an average branch is included after compare. the table data, used for 3 byte compare, is stored in memory. bytes clocks cmp.b r1l, r2l ; direct addressing 2 4 bne l1 ; average (6t/3nt) 2 4.5 cmp.w r0, mem2 ; 3 4 l1: cmp.w r0, mem1 ; 3 4 bxx label1 ; average 2 4.5 label1: ; xx > gt or lt or eq 9 17 (1.06 m s) a1.5: can move and compare application: for service of can (controller area network) serial interface it is estimated that 40* (2 byte compares + branch) have to be done. one parameter is in register , the other in internal memory . again, minimum execution times are considered. bytes clocks cmp r0, mem0 ; 3 4 bxx label ; average 2 4.5 5 9 (0.563 m s ) a1.6: linear interpolation arguments: r0 = table base (assumed < 400 hex) r2 = fraction 1 r4 = fraction 2 r6 = result bytes clocks lin_int: mov r6, [r0+] ; 2 4 mov r1, [r0] ; 2 3 sub r1, r6 ; 2 3 mulu.w r6, r2 ; 2 12 mov.b r1h, r1l ; 2 3 movs.b r1l,#0 ; 2 3 add r6, r1 ; 2 3 add r0, #15 ; 2 3 mov r1, [r0+] ; 2 4 mov r5, [r0] ; 2 3 sub r5, r1 ; 2 3 mulu.w r5, r2 ; 2 12 mov.b r1h, r1l ; 2 3 movs.b r1l,#0 ; 2 3 add r1, r5 ; 2 3 sub r1, r6 ; 2 3 mulu.w r1, r4 ; 2 12 mov.b r1h, r1l ; 2 3 movs.b r1l,#0 ; 2 3 add r6, r1 ; 2 3 ret ; 2 6 42 95 (5.94 m s) linear interpolation (2 dim. time / 3) = 14 bytes, 1.98 m s
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 8 a1.7: interrupt overhead note: interrupt overhead, as defined in the benchmark, applies to performance calculations. it does not consider the interrupt latency associated with completing the current instruction. all transfers are to / from internal memory , all addresses are 16-bit long. { saves 2 words on stack = 4 clks prefetching isr = 3 clks overhead through interrupt controller = 3 clks (allow synch + avoid metastability) i.e., total = 10 clks } interrupt accept/return 0/2 10+8 jmp rel16 ; uncond. x 2 3x2 6x2 bxx bit, rel8 ; branch on bit test x 2 2x2 4.5x2 call rel16 ; long call (pz assumed) 3 4 ret ; subroutine return 2 6 setb bit ; set bit x 2 3x2 4x2 clr bit ; clear bit x 2 3x2 4x2 push rlist (5) ; 5 push multiple 2 15 pop rlist (5) ; 5 pop multiple 2 12 mov pswl, #data8 ; imm. byte to pswl 4 3 mov pswh, #data8 ; needs 2 for 8bit sfr 4 3 ; bus 41 98 (6.1 m s) a1.8: program overhead branches are assumed taken 70% of the time, all addresses are external. code is assumed a run-time trace, code size cannot be calculated; based on the same approach taken for 80c196, code size is 1400 bytes. jmp rel16 ; long branch x 100 3x100 6 x 100 call rel16 ; call x 100 (page 0) 3x50 4 x 50 ret ; subroutine return x 100 2x100 6 x 50 bxx rel8 ; condl. short branch x 100 2x200 4.5 x 200 jb/jnbbit, rel8 ; bit test & branch x 100 2x100 4.5 x 100 1400 2,450 (153.1 m s) a1.9: xa totals function oc* xa bytes/function function oc* exec. time /funct.( m s) occurrence *time/funct. bytes/function mpy 12 0.75 9 2 fdiv 4 3.94 15.8 18 add/sub 50 0.38 19 4 cmp 24b 13 1.06 13.78 9 can 16b 40 0.563 22.52 5 intplin 20 5.94 118.8 42 interr 10 6.1 61 41 branch 10 153.1 conclusion: an assumption is made that xa code is in first 64k (pz) as the 80196 has a 64k address space only .
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 9 appendix 2 8051 function implementations 8051 reference: single chip 8-bit microcontrollers pcb83c552 users manual 1988 a2.1: 80c51 multiply 16 16 the 80c51 core performs 8-bit multiply only . a 16 16 multiply has to be done by splitting x and y into xh, xl and yh, yl so that: p3..p0 = (xh*256+xl)*(yh*256+yl) = xh*yh*65536+(xh*yl+xl*yh)*256+xl*yl clocks bytes mpy: mov r1,xh 2 3 mov r2,xl 2 3 mov r3,yh 2 3 mov r3,yl 2 3 mov a,r2 1 1 ;xl mov b,r4 1 3 ;yl mul ab 4 1 mov p0,a 1 2 ; lowest multiply result byte mov a,r4 1 1 ;yl mov r4,b 2 3 ; xl*yl upper byte (*256) mov b,r1 2 3 ;xh mul a,b 4 1 ;xl*yl add a,r4 1 1 mov r4,a 1 1 ;upper (xl*yl)+lower(xh*yl) in r2 mov a,b 1 2 addc a,#0 1 2 xch a,r2 1 1 ;xl upper (xh*yl) in r2 mov b,r3 3 2 ;yh mul a,b 4 1 ;xl*yh add a,r4 1 1 mov p1,a 1 2 mov a,b 1 2 addc a,r2 1 1 mov r2,a 1 1 mov a,r3 1 1 mov b,r1 2 3 mul ab 4 1 add a,r2 1 1 mov p2,a 1 2 mov a,b 1 2 addc a,#0 1 2 mov p3,a 1 2 total 50 58 50 clocks = 50*12 = 600 clocks (37.5 m s @ 16.0 mhz) 8051 mpy 16 16 (mpy bytes) 50 clocks = 37.5 m s / 58 bytes
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 10 a2.2 : 8051 divide (16/16) afloating pointo divide (r6, r7) (dividend) by (r4,r5) (divisor) with (r0) bits after the fraction point. alignment of msbits of operand in r6.7 and r4.7 using r0 as bit counter . clocks bytes fdv: inc r0 1 1 inc r0 1 1 mov r3,#0 1 2 mov r2,#0 1 2 clr c 1 1 clr f0 1 2 mov a,r4 1 1 jb acc.7, l2 2 3 jnz l1 2 2 mov a,r5 1 1 jz lx 2 2 l1: mov a,r5 1 1 rcl a 1 1 mov r5,a 1 1 mov a,r4 1 1 rcl a 1 1 mov r4,a 1 1 inc r0 1 1 jnb acc.7, l1 2 3 l2: mov a,r6 1 1 jb acc.7, l6 2 3 l3: mov a,r7 rlc a 1 1 mov r7,a 1 1 mov a,r6 1 1 rlc a 1 1 mov r6,a 1 1 djnz r0, $+4 2 2 ajmp lx 2=0 3 jnb acc.7,l3 2 3 ajmp l6 2 3 l4: mov a,r3 rlc a 1 1 mov r3,a 1 1 mov a,r2 1 1 rlc a 1 1 mov r2,a 1 1 jnc l5 2 2 mov r2,#0ffh 1 1 mov r3,#0ffh 1 1 sjmp lx 1 1 l5: clr c 1 1 mov a, r7 1 1 rlc a 1 1 mov r7,a 1 1 mov a,r6 1 1 rlc a 1 1 mov r6,a 1 1 jnc l5 1 1 mov f0,c 1 2
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 11 l6: clr c 1 1 mov a,r7 1 1 subb a,r4 1 1 jnc l7 2 2 jnb f0,l8 2 3 cpl c 1 1 l7: mov r6,a 1 1 mov a, r1 1 1 mov r7,a 1 1 l8: cpl c 1 1 djnz r0,l4 2 2 mov a,r3 1 1 add a,#0 1 1 mov r3,a 1 1 mov a,r2 1 1 add a,#0 1 2 mov r2,a 1 1 lx: ret 2 1 total 96 bytes 13 branch instructions (=35 bytes== 36%) timing : 3 divide cases : subtracts shifts total average 1. r0=0e, 8bit/14 bit > 158+2=9 8+2=9 32 subtracts 11 2. r0=08, 12bit/14 bit > 84+4=8 4+4=8 17+11 shifts 6+4 3. r0=10, 11bit/12 bit > 165+4=15 5+5 17+4*9+6*10+(15.5+10*31.5)+8=451.5 clocks = 338.6 m s 8051 ufdiv 16/16 (sub/sft) : 338.6 clocks = 451.5 m s, 96 bytes. a2.3: 8051 add/sub bytes clocks ads: clr c 1 1 mov a,x0 1 2 subb a,y0 1 2 mov z0,a 1 2 mov a,x1 1 2 subb a,y1 1 2 mov z0,a 1 2 mov a,x2 1 2 subb a,#0 1 2 mov z2,a 1 2 10 19 8051 add/sub in reg file 10 clocks = 7.5 m s, 19 bytes 8051 cmp enabling jz jnz jc jnc the 8051 decisions made with branches are one of these three : jc lt 2 2 jc 2 2 jz eq 2 2 jc 2 2 jnz gt 2 2 8051 compare decision branches take average : 10/3 clocks => 2.5 m s
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 12 a2.4: 8051 cmp 3 byte compare bytes clocks cm3: clr c 1 1 mov a,x2 1 2 subb a,y2 1 2 mov r0,a 1 2 mov a,x1 1 2 subb a,y1 1 2 orl r0,a 1 2 mov a,x2 1 2 subb ay2 1 2 orl a,r0 1 2 jcc xxxx 3.33 3.33 10 19 8051 cmp 3 byte data in reg file 13.3 clocks = 9.975 m s, 22.3 bytes a2.5: 8051 2-byte can compares bytes clocks can: mov dptr,ax1 2 3 ; one compare src in xram movx a, @dptr 1 2 cjne a,y1 1 2 mov dptr,ax2 1 2 ; one compare src in xram movx a,@dptr 1 2 cjne a,y2 2 3 12 14 8051 can cmp xram/direct 9 m s, 14 bytes
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 13 a2.6: 8051 2-dimensional interpolation at the start registers are prepared a : position in table (x+16*y) dptr : start address of table (aligned at 256 byte boundary) r0 : xfraction r1 : yfraction result : acc registers used : acc,r0,r1,r2,r4,r5,r6 clocks bytes int: mov dpl,a 1 2 ;pos x,y acall gval 2 2 mov r4,a 1 1 mov a,dpl 1 2 add a,#15 1 2 mov dpl,a 1 2 acall gval 2 2 mov reg6,r4 1 2 mov b,r1 1 2 acall intp 1 2 ret 2 1 gval: movx a,@dptr 2 1 mov r6,a 1 1 inc dpl 2 1 movx a,@dptr 2 1 mov b,r0 1 2 intp: clr sf 1 2 clr c 1 1 subb a,r6 1 1 jnc int1 2 2 cpl a 1 1 inc a 1 1 setb sf 1 2 int1: mul a,b 4 1 xch a,b 1 2 clr c 1 1 rrc a 1 1 xch a,b 1 2 xch a,b 1 2 clr c 1 1 rrc a 1 1 xch a,b 1 2 jb sf,int2 2 3 addc a,r6 1 2 ret 2 1 int2: xch a,r6 1 2 subb a,r6 1 2 ret 2 1 total 2-dim. interpolation : 15+2*(8+24)+24=103 clocks = 77.25 m s, 59 bytes 8051 linear interpolation : (2dim. intp time /3) = 103/3 =25.75 m s, 20 bytes
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 14 a2.7: 8051 interrupt overhead bytes clocks a. interrupt 2 2 (vector) reti 2 1 b. ajmp 2* 4 4 c. jb 2* 4 6 d. acall 2 2 ret 2 1 e. setb 2* 2 4 clrb 2* 2 4 f. pop 5* 10 10 push 5* 10 10 g. mov 1* 2 2 42 46 8051 interrupt overhead 42 clocks = 31.5 m s a2.8 : 8051 program overhead type occurrence 8051 bytes ljmp/jmp 100 2 200 3 300 lcall/jsr 100 2 200 3 300 jcc/bcc 200 2 400 3 600 jb/jbn 100 2 200 3 300 total cylces m sec 1000 750 1500 a2.9: 8051 totals function oc* 8051 function oc* exec *oc 1. mpy 12 37.5 450 2. fdiv 4 338.6 1354.4 3. add/sub 50 7.5 375 4. cmp 24b 13 9.98 129.74 5. can 16b 40 9 360 6. intplin 20 25.8 516 7. interr 10 31.5 315 8. branch 10 750 8051 totals : 4250.14 m s including 20% statistics : 5,100.2 m s
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 15 appendix 3 68000 implementations 68000 reference: sc68000 microprocessor users manual (motorola copyright; philips edition 12nc: 4822 873 301 16) a3.1 : 68000 16x16 multiply the 68000 can use 1 with mul and move a long word result. mul r0,r1 2 70 total: 4.375 m s, 2bytes a3.2: floating point division 16:16 (r0) accuracy, (r1)/(r2) r1 result bytes clocks fdv: ext.l r1 2 4 tst r2 2 4 beq l1 2 10/8 asl r0,r1 2 32 divu r2,r1 2 140 bvc l2 2 10/8 l1: movi #1,r1 2 4 l2: rts 2 16 total : 214 clocks or 13.375 m s, 16 bytes a3.3: add/sub bytes clocks adds: mov.l a,r0 6 20 add.l r0,c 6 48 total : 44 clocks or 2.75 m s, 12 bytes a3.4: compares 24 (=32) bit bytes clocks cmpl: mov.l x,r0 6 20 cmp.l y,rn 6 22 blt/eq/gt (av) 2 9 total : 51 clocks or 3.19 m s, 14 bytes a3.5: can move and compares (16-bit) bytes clocks cmpw: mov.w x,r0 6 16 cmp.w y,rn 6 18 blt/eq/gt (av) 2 9 total : 43 clocks or 2.69 m s, 14 bytes
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 16 a3.6: 2-dimensional interpolation a0 : table position, r0 : fraction1, r1 : fraction 2 , r2 : result, r3, r4 bytes clocks cmpw: mov.w (a0), r2 2 8 addq.l #1,a0 2 8 mov.l (a0), r3 2 8 sub.w r2,r3 2 4 mulu r0,r3 2 74 asr.l #8,r3 2 28 add.w r3,r2 2 4 addi.l #15,a0 4 8 mov.w (a0),r3 2 8 addq.l #1,a0 2 8 mov.w (a0),r4 2 8 sub.w r3,r4 2 4 mulu r0,r4 2 74 asr.l #8,r4 2 28 add.w r4,r3 2 4 sub.w r2,r3 2 4 mulu r1,r3 2 40 asr.l #8,r3 2 22 add.w r3,r2 2 4 rts 2 16 total : 362 clocks or 22.62 m s, 42 bytes linear interpolation is 2-dim. interpolation /3 : 1-dim. interpolation 7.54 m s, 14 bytes a3.7: 68000 interrupt overhead clocks bytes a. interrupt 44 4 reti 20 2 b. jmp 2* 24 24 c. btst+bne 2* 60 16 d. bsr 18 4 rts 16 2 e. bset/bclr 4* 96 24 f. movem 2* n=5 64 12 g. movi #xx,ccr 8 4 350 92 68000 interrupt overhead 350 clocks = 21.87 m s, 92 bytes
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 17 a3.8 : 68000 program overhead for the 68000, the jb/jbn branches have to be constructed : clocks bytes mov.w abs.l,rn 12 6 andi.w #bitmask,rn 8 4 beq/bne rel.address 10 2 total jb/jnb execution : 34 clocks, 12 bytes now the absolute (estimated) branch time can be calculated, taking the core dif ference in account. type occurrence 68000 bytes ljmp/jmp 100 12 1200 6 600 lcall/jsr 100 20 2000 8 800 jcc/bcc 200 10 2000 2 400 jb/jbn 100 34 3400 12 1200 total cycles m sec 8600 537.5 3000 a3.9: 68000 totals function oc* 68000 function oc* exec *oc 1. mpy 12 4.4 52.8 2. fdiv 4 13.4 53.6 3. add/sub 50 2.75 137.5 4. cmp 24b 13 3.2 41.6 5. can 16b 40 2.7 216 6. intplin 20 7.5 150 7. interr 10 21.9 219 8. branch 10 537.5 68000 totals : 1,300 m s including 20% statistics : 1,560 m s
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 18 appendix 4 80c196 function implementations 80c196 reference: embedded controller handbook vol ii-16 bit copyright : intel corp. a4.1: 80c196 unsigned multiply p=x*y (16x16) bytes clocks mul r0,r1 3 28 total: 1.75 m s, 3 bytes a4.2: floating point division 16:16 (r0) accuracy, (r4)/(r8) r4 result bytes clocks fdv: ext r4 2 4 and r8,#ffff 4 5 je l1 2 8/4 shll r4,r0 3 20 divu r8,r4 3 24 jnv l2 2 4/8 l1: ld r4,#ffff 2 5 l2: ret 1 11 total: 76 clocks or 9.5 m s, 19 bytes a4.3: add/sub bytes clocks adds: sub r5,r1,r3 3 5 subb r4,r0,r2 4 5 total : 10 clocks or 1.25 m s, 7 bytes a4.4: 80c196 a3-byte compareo bytes clocks cmp rn,y1 5 9 bne l1 2 4/8 cmp rm,y2 5 9 l1: blt/eq/gt (av) 2 4/8 average total: 34 clocks or 4.25 m s, 14 bytes a4.5: can move and compares (16-bit) bytes clocks cmp rx,y 4 9 blt/eq/gt (av) 2 6 total : 15 clocks or 2.5 m s, 6 bytes
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 19 a4.6: 80c196 2-dimensional interpolation using in-line linear interpolations r0 : table position, r2=fraction1, r4=fraction2, r6=result, r8, r10 bytes clocks ld r6,[r0]+ 3 6 ld r8,[r0]+ 3 5 sub r8,r6 3 4 mulu r8,r2 3 14 shral r8,#8 3 15 add r6,r8 3 4 add r0,#15 4 6 ld r8,[r0]+ 3 6 ld r6,[r0] 3 5 sub r10,r8 3 4 mulu r10,r2 3 14 shral r10,#8 3 15 add r8,r10 3 4 sub r8,r6 3 4 mulu r8,r4 3 14 shral r8,#8 3 15 add r6,r8 3 4 ret 1 14 total : 153 clocks or 19.1 m s, 53 bytes linear interpolation is 2-dim. interpolation /3 : 1-dim. interpolation 6.4 m s, 18 bytes a4.7 80c196 interrupt overhead clocks bytes a. interrupt /rte 27 2 b. ljmp 2* 14 6 c. jb 2*av.7 14 6 d. call/rts 22 4 e. bset/bclr 4* 28 16 f. pop 5* 40 10 push 5* 55 10 g. movi #xx,ccr 5 4 205 58 80c196 interrupt overhead 205 clocks = 12.8 m s, 58 bytes
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 20 a4.8 : 80c196 program overhead type occurrence 68000 bytes ljmp 100 7 700 3 300 lcall/ret 100 22 2200 4 400 jcc/bcc 200 7 1400 2 400 jb/jbn 100 7 700 3 300 total cycles m sec 6000 375 1400 80c196 totals : 958.1 m s including 20% statistics : 1150 m s function oc* 80c196 function oc* exec *oc 1. mpy 12 1.75 21 2. fdiv 4 9.5 38 3. add/sub 50 1.25 62.5 4. cmp 24b 13 4.25 55.2 5. can 16b 40 1.88 150.4 6. intplin 20 6.4 128 7. interr 10 12.8 128 8. branch 10 375
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 21 bit manipulation copy a bit from one location to another in memory . complement the bit in the new location note: assumed that memory is on-chip and directly addressed. bit axo of mem0 needs to be copied to bit ayo of mem1. xa clr c ; clear carry 3 4 orl c, /bitm ; compl. bit and save in c 3 4 mov bitn, c ; move mem0.x > mem1.y 3 4 9 12 (0.75 m s) intel 80c196 note : states = clock (period)/ 2 move complement of bit amo to ano in memory r3 = memory byte having bit amo r4 = memory byte having bit ano r0 = used as bit-mask register r1 = position of amo in mem0 r2 = position of ano in mem1 bytes states ld r0, 1 ; load 1 in reg shlb r0, r2 ; position of bit an 3 16 ; in r2 notb r0 ; complement 2 4 jbc r3,bitm, l1 ; test bit amo polarity 3 7 (av) andb r4, r0 ; reset ano if amo = 0 3 4 l1: orb r4, r0 ; set amo otherwise 3 either/or 14 31 (3.88 m s) motorola 68000 bytes states btst bitm ; test bit 2 4 beq l1 ; branch if reset 2 6 bclr bitn ; test bit and clear (~m = 0) 2 4 ....... ....... l1: bfset bitn ; test bit and set (~m = 1) 2 either/or 8 14 (0.88 m s) 8051 bit-test mov c, bitm 2 12 cpl c 1 12 mov bitn, c 2 24 5 48 (3.0 m s)
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 22 xa code density results graph showing performance with respect to 68000, and 80c196 cores normalized with respect to xa. the 80c51 is included just for reference. xa 68000 80c196 8051 mpy 1 1 1.5 1 fdiv 1 0.89 1.06 5.33 add/sub 1 3 1.75 2.5 cmp 24b 1 1.6 1.6 1 can 16b 1 2.8 1.2 1.5 intplin 1 0.33 0.43 0.33 interr 1 2.24 1.41 1.71 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ? ? ? ? ? ? ? ? ? ? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ? ? ? ? ? ? ? ? ? ? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ? ? ? ? ? ? ? ? ? ? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 0 0.5 1.0 1.5 2.0 2.5 3.0 mpy fdiv add/sub cmp 24b can 16b intplin interr ? ? ? ? ? ? ? ? xa 68000 80c196 68000 80c196 xa su00599a
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 23 xa execution time results graph showing performance with respect to 68000, and 80c196 cores normalized with respect to xa. the 80c51 is included just for reference. xa 68000 80c196 8051 mpy 1 5.87 2.33 50 fdiv 1 3.4 2.41 86 add/sub 1 7.2 3.3 19.74 cmp 24b 1 3.02 4 9.41 can 16b 1 4.8 4.44 15.98 intplin 1 1.26 1.08 4.34 interr 1 3.6 2.1 5.16 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ? ? ? ? ? ? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ? ? ? ? ? ? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ? ? ? ? ? ? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 0 1 2 3 4 5 6 7 8 mpy fdiv add/sub cmp 24b can 16b intplin interr ? ? ? ? ? ? ? ? xa 68000 80c196 su00600a 68000 80c196 xa
philips semiconductors application note AN703 xa benchmark versus the architectures 68000, 80c196, and 80c51 1996 mar 01 24 bit test benchmark: code density normalized with xa (=1.0) the 80c51 is shown here only for reference. xa 68000 80c196 8051 code density 1 0.89 1.6 0.6 0 0.5 1.0 1.5 2.0 xa 68000 80c196 80c51 code density su00601 bit test benchmark: execution time normalized with xa (=1.0) the 80c51 is shown here only for reference. xa 68000 80c196 8051 execution time 1 1.2 5.2 4 0 1 2 3 4 5 6 xa 68000 80c196 80c51 execution time su00602

▲Up To Search▲

Price & Availability of AN703

	To Download AN703 Datasheet File
If you can't view the Datasheet, Please click here to try to view without PDF Reader .